Supervised Classification Using Balanced Training

نویسندگان

  • Mian Du
  • Matthew Pierce
  • Lidia Pivovarova
  • Roman Yangarber
چکیده

We examine supervised learning for multi-class, multi-label text classification. We are interested in exploring classification in a realworld setting, where the distribution of labels may change dynamically over time. First, we compare the performance of an array of binary classifiers trained on the label distribution found in the original corpus against classifiers trained on balanced data, where we try to make the label distribution as nearly uniform as possible. We discuss the performance tradeoffs between balanced vs. unbalanced training, and highlight the advantages of balancing the training set. Second, we compare the performance of two classifiers, Naive Bayes and SVM, with several feature-selection methods, using balanced training. We combine a Named-Entity-based rote classifier with the statistical classifiers to obtain better performance than either method alone.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using statistical text classification to identify health information technology incidents

OBJECTIVE To examine the feasibility of using statistical text classification to automatically identify health information technology (HIT) incidents in the USA Food and Drug Administration (FDA) Manufacturer and User Facility Device Experience (MAUDE) database. DESIGN We used a subset of 570 272 incidents including 1534 HIT incidents reported to MAUDE between 1 January 2008 and 1 July 2010. ...

متن کامل

Semi-supervised Learning from Unbalanced Labeled Data - An Improvement

We present a great improvement while performing semi-supervised learning tasks from training data sets when only a small fraction of the data pairs is labeled. In particular, we propose a novel decision strategy based on normalized model outputs. We give the explanation why the normalization step helps. The paper compares performances of two popular semi-supervised approaches (Consistency Metho...

متن کامل

Bootstrapping Supervised Machine-learning Polarity Classifiers with Rule-based Classification

In this paper, we explore the effectiveness of bootstrapping supervised machine-learning polarity classifiers using the output of domain-independent rule-based classifiers. The benefit of this method is that no labeled training data are required. Still, this method allows to capture in-domain knowledge by training the supervised classifier on in-domain features, such as bag of words. We investi...

متن کامل

Exploiting Unlabeled Texts with Clustering-based Instance Selection for Medical Relation Classification

Classifying relations between pairs of medical concepts in clinical texts is a crucial task to acquire empirical evidence relevant to patient care. Due to limited labeled data and extremely unbalanced class distributions, medical relation classification systems struggle to achieve good performance on less common relation types, which capture valuable information that is important to identify. O...

متن کامل

Classification of ECG signals using Hermite functions and MLP neural networks

Classification of heart arrhythmia is an important step in developing devices for monitoring the health of individuals. This paper proposes a three module system for classification of electrocardiogram (ECG) beats. These modules are: denoising module, feature extraction module and a classification module. In the first module the stationary wavelet transform (SWF) is used for noise reduction of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014